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Abstract 

Identification of every single genome present in a microbial sample is an important and 
challenging task with crucial applications. It is challenging because there are typically millions 
of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate 
method to date is exhaustive single cell sequencing using multiple displacement amplification, 
which is simply intractable for a large number of cells. However, there is hope for breaking this 
barrier as the number of different cell types with distinct genome sequences is usually much 
smaller than the number of cells. 

Here, we present a novel divide and conquer method to sequence and de novo assemble 
all distinct genomes present in a microbial sample with a sequencing cost and computational 
complexity proportional to the number of genome types, rather than the number of cells. The 
method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated 
data. The proposed divide and conquer method successfully reduces the cost of sequencing in 
comparison with the naive exhaustive approach. 

Availability: Squeezambler and datasets are available under 



http : / / compbio . cs . wayne . edu/ software/ squeezambler/ 
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1 Introduction 



Critical applications, including the Human Microbiome Project ( Methe et al. . 20121 ). biothreat 
detection, and combating antibiotic resistant pathogens, necessitate identification of all distinct 
genome sequences in a bacterial sample. When prior knowledge is available about which organ- 
isms may be present in the sample, flow cytometry and 16S rRNA gene sequencing may be used. 
However, metagenomics is usually used for analyzing the genomics of relatively abundant microbes 
when no prior knowledge is given. Metagenomics consists in the study of the variation of species 
in a complex microbial sample. Since the vast majority of environmental bacteria elude cultiva- 
tion, metagenomics investigates microbial communities by sequencing sampled genome fragments 
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Figure 1: The divide and conquer algorithm for an example with 10 cells and 3 distinct genomes 
shown in different colors. Each row corresponds to one sequencing round. The number of barcodes 
in each round is the number of blue boxes in the corresponding row. 



without the need for culturing. Such a heterogeneous pool of sequencing r eads can also be assem- 



bled to yield a superposition of highly abundant genomes in the sample (jTreangen et all 120131 ) 



There are two problems with metagenomics: (i) the resulting assembly contains multiple genomes 
superimposed, and (ii) only highly abundant genomes survive the sampling process. 

Advances in DNA amplification technology have enabled whole genome sequencing directly 
from individual cells without requiring growth i n culture. Single cell sequencing methods have en- 



abled investigation of novel uncultured microbes (jKvist et all 120071 ; iMussmann et all 120071 ) . These 



culture-independent single cell studies are a powerful alternative to metagenomics studies. Genomic 
seque ncing from single ba c terial genomes was first demonstrated with cells isolated by flow cytom 



2002, 



etry ( Raehunat han et al. 



2001 



Hosono et al 




using multiple displacement amplification (MDA) (IPean et al. 



I). MP A is now the pre ferred method for whole genome amplifica- 



tion from single cells ( Lasken , 2007 ;_ Ishoev et al. , 20081 ) . The first attempt to assemble a complete 
bacterial genome from one cell ( Zhang et al. , 20061 ) further explored the challenges of assembly from 
amplified PNA, including amplification bias and chime ric PNA rearrangements. A mplification bias 



results in orders of magnitude difference in coverage ( Raghunathan et al. . 20051 ) . and absence of 



coverage in some regions. Chimera formation occurs during the PNA branching process by which 
the 4>29 PNA polymerase generates PNA amplification in MPA (Lasken and Stockwell, 2007] ) . Sub- 



sequent studies continued to improve sing l e cell assemb l ies (IMarcv et al. 



Hongoh et all 120081 : iRodrigue et all 12003 : IWoyke et all 120091) . A nearly full pot ential of single cell 



2007; Podar et al. 



2007; 



20 111 followed by those 



ge nome assembly has rec entl y been reali z ed by the work of lChitsaz et al 
of lBankevich et all I2OI2I and IPeng et oil . 12012 . 

Pue to recent progress in single cell PNA amplification techniques and de novo assembly al- 
gorithms, the genomes of all bacterial species in a sample can be captured one cell at a time. 
However, there are often millions of cells per sample, in which case the naive deep sequencing of 
every cell becomes prohibitively costly. Moreover, it is expected that deep sequencing of every cell 
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is often not necessary as the majority of biologica lly important samples are sparse in the sense that 
many cells are biological replicates. Compressed ( Candes and Tao . 2005 . 2006 : Donoho . 20061) and 



distil led (adaptive sampling-and-refinement) sensing methods (jHaupt et all 



2011 



Wei and Herd . 



20121 ) have been proposed in the last decade to exploit sparsity for reducing the cost of sens- 



i ng and reconstructing signals in various spac es, ranging from Banach spaces to Boolean algebras 
(jErlich et all l20lrt : IStobbe and Krausel . l2012h . Inspired by those advances, we give an algorithm 
in this paper that exploits sample sparsity to reduce the cost of sequencing without compromising 
the accuracy of identification of all distinct genomes, even the ones that are minimally represented 
in the sample. 



2 Approach 

A naive approach to solve the problem, which we call single cell co-assembly strategy, is to amplify 
the genome of each cell, barcode them individually, pool them, sequence in one sequencing run, 
and demultiplex based on the barcode. In this approach, each cell should be isolated and its DNA 
extracted and amplified using multiple displacement amplification. Although there is currently 
no high throughput device to perform these processes, one could envision automated microfluidic 
devices that will be capable of high throughput separation, DNA extraction, amplification, and 
barcoding of single ce lls in the near future. T he output sequencing reads could then be co-assembled 
using our tool HyDA (|Movahedi et all l2012h . In HyDA, the read dataset of each cell is assigned a 



unique color. All the colors are co-assembled in one colored de Bruijn graph. This approach requires 
enough unique barcodes to tag the fragments of each cell. Also, barcodes attached to each fragment 
are sequenced, which imposes additional sequencing cost. Fabrication of so many unique barcodes 
becomes prohibitively expensive for a large number of cells and therefore, this naive approach will 
not work. 

The number of dist inct genomes in a microbial community is often much less than the number 



of cells. For example, iQin et all . I201Q estimated the number of detected distinct species in the 
human gut to be in the order of 1,000, while the number of microbial cells in a human body, 
most of which reside in the gut, is in the order of 100 trillion. We call this effect the sparsity 
of distinct genomes in a sizable microbial population. The naive approach does not exploit the 
sparsity to reduce the cost of sequencing. Here, we proposed to exploit the spareness by adopting 
a divide and conquer strategy to reduce the amount of required barcodes and sequencing. After 
isolation of each cell and extraction of the DNA, every DNA is amplified and stored separately, 
e.g. in a microfluidic droplet. The main idea is to sample the amplified DNAs adaptively, which is 
essentially allocating sequencing and barcoding resources dynamically over the course of multiple 
sensing iterations. Initially, the algorithm has one group of cells, which is the entire sample. In each 
iteration, each group is divided into two equally-sized subgroups. A small amount of DNA from 
each cell in a subgroup is sampled, pooled, and sequenced. In practice, one barcode per subgroup 
is used to multiplex and demultiplex the sequencing in one run. The amount of sampling from each 
cell is computed based on the res ults of previous iter atio ns. This is called res ource allocation and 
is similar to the one proposed by Haupt et al , 2011 and Wei and Hero . 20121 . The resulting read 



datasets, one per subgroup, are then co-assembled and compared using HyDA to decide pairwise 
subsumption of subgroups. The cells in those subgroups that are subsumed by other subgroups, 
even in previous iterations, are eliminated from further analysis. This procedure is continued until 
each remaining group contains only one cell. The resulting non-redundant single element groups 
capture all of the distinct genome sequences. 
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3 Methods 



Although distinct genomes are often identified as different species, there are numerous cases where 
distinct genomes are categorized as varied instances of the same species or even the same strain. 
Instead of identification of strains and species which are currently phenotypic notions, the goal 
of our approach is to find all distinct genomes in a sample. We define two genome sequences to 
be distinct if the ratio of their differences over the whole genome size is above a threshold. That 
threshold is input by the user and controls a trade-off between sensitivity and specificity. 

Co-assembly and comparison of multiple input read datasets lie at the core of both approaches 
we take in this paper. W hile there are assembly to o ls for single cell genomic data, such as SPAdes 
(jBankevich et all , boij ) and IDBA-UD rtPeng et all, hoij) and also co-assembly tools for normal 



multicell genomic data such as Co rtex (llabal et al. 



date that has both functionalities ( Movahedi et al. 



, 2012), we use HyDA which is the only tool to 
However, the novel ideas proposed here 



can also be implemented using other assemblers. For the sake of completeness, HyDA algorithm is 
summarized in the following. 



3.1 Co-assembly Algorithm 

3.1.1 Construction and Condensation of the Colored De Bruijn Graph 

The colored de Bruijn graph, a variation of the standard de Bruijn graph, is a combinatorial 
structure that can be used to assemble a numb er of input read d atasets, each represented by a 
color, superimposed on a single de Bruin graph ( Iqbal et al. . 20121 ). The output of such assembly 
methods is a number of assembled sequences (contigs) and the corresponding average multiplicity 
in each color. Our de Bruijn graph of the input reads is stored in a hashed collection of splay 
trees whose vertices are fe-mers with an array of multiplicity counts (one entry per color), in- and 
out-edges, and internal flags. A 1-in 1-out chain of /c-mers is condensed into an equivalent long 
sequence which is called a unitig. A maximal unitig, which cannot be extended further due to 
a branch in the graph, is a contig. Note that in HyDA, our condensation is solely based on the 
topology of the graph without any attention to the colored multiplicities. Ignoring multiplicities 
for condensation is purposefully done, and constitutes the feature that all ows the assembler to deal 



2011 



with black out regions in single cell multiple displacement amplification (jChitsaz et al. 

Contigs with low coverage are often caused by sequencing error. The low coverage contig 
removal process is iterated with an increased cutoff in each round. In each iteration, those contigs 
whose maximum coverage over all colors is less than the cutoff are eliminated, and the remaining 
graph is recondensed. This causes some contigs to merge into larger ones with recomputed average 
coverages. This process is similar to Velvet-SC's low coverage contig removal, but instead of 
considering one average coverag e per contig, HyDA c onsiders the maximum of average coverages 



for all the colors of each contig (jChitsaz et all l201ll ). In this case, only those contigs that have 



low coverage in all colors are considered erroneous and removed. Another possible approach is to 
eliminate those contigs for which the mean of average coverages for all colors is less than the cutoff. 
However, if we were to follow this approach, a contig that is well covered in one color but is poorly 
covered or absent in hundreds of colors would be lost since the mean would dilute the effect of 
coverage in one color among hundreds of colors. This approach would not work for us here because 
our goal is to be able to preserve rare contigs. 
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3.1.2 Redundancy Removal 

To remove redundant genomes, we define a relation that is reminiscent of subset relation on the set of 
contigs for each color. Note that our goal here is to remove redundant genomes, which are collections 
of contigs, rather than to remove individually redundant contigs. Let A = {a±, a 2 , ■ ■ ■ , a r } be the 
set of remaining contigs after iterative error removal. Let Mj{ai) denote the average coverage of 
contig a, in color j, for 1 < i < r and 1 < j < s, where s is the number of colors. Pick e > and let 
Aj = {aj € A | Mj(cii) > e} C A be the set of contigs for each color j. The parameter e determines 
the trade-off between specificity and sensitivity. We chose e = in this study, but a non-zero e 
might be needed if there are erroneous or contaminant fc-mers in one color that also occur in the 
true genomic sequence of another color. 

We define D T (Ai, Aj) G M on the set T = {A 1 ,A 2 , ... , A s } as: 

D T (A i ,A j )=r-^^-, (1) 

in which Ai\Aj = {a € Ai\a ^ Aj}, and || • || denotes the total assembly size. We define: 

Ai< T AjiSO<D T (Ai,Aj), (2) 

in which r > 0. Particularly, becomes the subset relation and can be used to detect and remove 
redundant collections of contigs, i.e. those that are subsumed by a larger collection. However 
in reality, the mathematical subset relation is not adequate as there are various types of noise 
including sequencing errors, intrastrain variations such as SNPs and indels, contaminations added 
in the amplification and sequencing process, and lack of coverage in some regions caused by the 
MDA. Hence, the definition of subset should be loosened by choosing a small but nonzero value 
for r. Beside e, the value of r gives a trade-off between specificity and sensitivity of recognizing 
distinct genomes. If r is small, the algorithm detects two equivalent genomes as distinct, and if 
r is large, distinct genomes are declared equivalent. To see how r is chosen, refer to Sections 13.21 
and 14.21 The results are shown in Table El Finally, we compute a non-redundant set of assemblies 
£ = {A^, Ai 2 , . . . , Ai t } C T such that for every distinct pair 1 < a, b < t, Ai a ^ T Ai b and 
Ai b Ai a . 

3.2 Divide and Conquer Strategy Exploiting Sparsity 

Let n be the number of cells in the sample, and denote the cells by S z , i = 1, . . . , n. Our algorithm 
aims to assemble all of the distinct genomes and identify at least one cell per distinct genome. To 
reach this goal, our algorithm iteratively pools samples of amplicons from different cells, tags each 
pool with a unique barcode, mixes the barcoded pools, and has the result sequenced. The objective 
is to minimize the total number of bases required to be sequenced as well as the number of different 
barcodes needed. 

In the first iteration, we divide the n cells S , . . . , S n into two sets 

h A = {S\...,S^}, 

I 1>2 = {S [n/2i+ \...,S n }. 

Our algorithm samples equal amount of amplicons from each cell in In and The amplicons in 
each set are pooled and tagged by two distinct barcodes. The barcoded amplicons are sequenced 
to reach a desired number of base pairs. This number is an input parameter of our algorithm. 
We define the total number of base pairs sequenced from to be bi^, for i = 1,2. The two 
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Table 1: The 9 chosen species for our 


simulation. 






NCBI ID 


Name 


Ref. Status 


Size (bps) 


No. of Cells 


NCL004663.1 


Bacteroides thetaiotaomicron VPI-5482 chromosome 


complete 


6.29 M 


23 


NCL009614.1 


Bacteroides vulgatus ATCC 8482 chromosome 


complete 


5.16 M 


7 


NC.009615.1 


Parabacteroides distasonis ATCC 8503 chromosome 


complete 


4.81 M 


8 


NC.008532.1 


Streptococcus thermophilus LMD-9 


complete 


1.86 M 


2 


NC.016776.1 


Bacteroides fragilis 638R 


complete 


5.37 M 


1 


FP929046.1 


Faecalibacterium prausnitzii SL3/3 


draft 


3.21 M 


12 


FP929051.1 


Ruminococcus bromii L2-63 


draft 


2.25 M 


35 


FP929053.1 


Ruminococcus sp. SRI/5 


draft 


3.55 M 


12 


FP929055.1 


Ruminococcus torques L2-14 


draft 


3.34 M 


15 



read datasets are co-assembled by HyDA using two colors. The result is two sets of contigs for each 
color, A\ t \ and A\^- We calculate D T1 (Ai^, Ai^) and D Tl (-Ai,2, A\ t \) as defined in ([T]), in which 
t\ = t/ maxj \hj\, r is an input parameter, and | • | is the set cardinality. We choose r to be the 
maximum allowable difference between the assembly of two single cells from the same strain. Based 
on these values, we decide if the relations A\^ < Tl A\^ and A\p < Tl A\^\ hold. If A\^\ is a subset 
of Ai^, then all of the distinct genomes in In are present in ii^, therefore, the cells in I\\ do not 
need further sampling. This applies to 1%^ too, if A\p is a subset of Ai t i. If both relations hold, 
one of I\ \ or 2i 2 is eliminated arbitrarily from further analysis. Each remaining set Ji . is divided 
into two subsets for analysis in the second iteration. Fig. [1] depicts an example of 10 cells with 3 
distinct genomes shown in different colors. 

The same splitting process occurs in the subsequent iterations. Assume Jj i, . . . , ij jTOj are the 
remaining sets in iteration i. Each set Jjj is sampled to produce bij base pairs, barcoded uniquely, 
pooled, and sequenced. All of the new sequence datasets and those obtained in all previous iterations 
are co-assembled. In the co-assembly, the previous datasets help to improve the assembly of the new 
ones. The resulting contig set of Tjj is denoted by Aij, For all j, k = 1 . . . mj, the relations Aij -< Ti 
Aik are evaluated, where r, is a threshold whose calculation will be explained below. The cells in 
those Jj j whose assemblies are subsumed will be removed from further analysis. All the remaining 
ones are partitioned into two disjoint subsets. Denote the new subsets by -Zj+^i, . . . , -/j+i jmi+1 . Note 
that in iteration i, rrii unique barcodes are needed. Therefore, 

m = max mi (3) 

i 

is the maximum number of barcodes required for the entire algorithm. 

Parameters hi j play a key role in the algorithm. We propose an adaptive calculation of b j j to 
minimize, without losing accuracy, the total base pairs sequenced: 

& = !>,;• W 

Assume Jjj is a set that is created by dividing the set Ii-ik in iteration % — 1 into two. We are 
motivated to choose the total number of sequenced base pairs from cells in Ij j to be proportional 
to the total assembly size ).\\, i.e. 

b i:j = c x (5) 

where c is an input parameter indicating the estimated average coverage of each iteration. We say 
Aij are accurate enough if the partial order relation < Ti can be assessed accurately. If c is large 
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and the assembly of iteration i — 1 is accurate enough, then in iteration i, adequate base pairs are 
sequenced to allow an accurate enough assembly of set Iij. 

Another factor that affects accuracy of the assessment of these relations is the choice of Tj. 
The threshold 73 in the i th iteration is used to detect cells with similar genomes in spite of small 
differences in their assemblies. We propose to use the following threshold: 

n = Ft i - (6) 

max l<j<m 4 \H,j\ 

Recall that r is the maximum allowable difference between the assembly of two single cells with 
similar genome sequences. To account for the worst possible case, it is assumed that there are 
distinct genomes in each group Jjj. Therefore, maxi<j< mi \Iij\ captures the maximum number 
of distinct genomes in Ijj from any Zjfc. With these assumptions, 75 is a conservative threshold. 
This threshold will guarantee that two distinct genomes are detected, but it has the possibility 
of detecting similar genomes as distinct. In the last iteration of the algorithm, when every group 
consists of one cell, the threshold is r. Note that the number of iterations, which is the number of 
sequencing rounds, is always [~log 2 n~|. 



3.3 Implementation 

We implemented our algorithm in a tool called Squeezambler 1 . in C++. Our tool and datasets 
are available under [http : // compbio . cs . wayne . edu / software / squeezambler / 1 



4 Results 



Since we did not have acce ss to real data, we te s ted o ur algorithm using simulated data. We 
used our tool MDAsim 1.0 (jTaghavi and Draghicil . |2012| ) to simulate 100 multiple displacement 
amplification processes (one process per cell) from 9 distinct gen omes. The output of MDAsim 1 . 
was fed into an Illumina read generator, ART ( Huang et ali 20121 ). to generate Illumina reads, with 
realistic errors, from the simulated amplicons. The set of generated reads for each cell were treated 
like a microfluidic droplet from which samples without replacement are extracted in each iteration 
of Squeezambler 1 . 0. We assume that MDA products are contamination free, which requires a 
contaminant free automated microfluidic cell sorting, amplification, and sampling device. 



4.1 Datasets 



Totally 115 MDAs were simul ated from 9 dist inct genomes chosen from the list of species found 
in a gut metagenomics study (|Qin et ad lioich . that have a complete or draft reference genome. 
Recall that we are simulating MDA from a reference genome, therefore, we needed a reference 
genome for the chosen species. Table [U summarizes the NCBI ID, name, size, reference status 
(complete or draft), and the number of cells we have simulated. The num ber of cells i s app roximately 



proportional to the abundance mean of the corresponding species in (|Oin et all \20ld ). We ran 



MDAsim 1.0 with a diverse set of parameters, one for each cell, to capture the diverse nature of 
MDA coverage bias. ART, an Illumina read generator, was then deployed to generate 100 bps 
Illumina reads from the simulated amplicons. The amplification gain of MDAsim 1.0 was 300 x 
and that of ART was 8x from which 1/8 were selected randomly to obtain a total gain of 300 x. 
We assembled the dataset of each cell individually with HyDA 1.0, and t he resulting ass e mblie s 



have between 0.1% and 4.2% missing reference bases measured by Gage (Salz berg et al 



which is similar to the real world situation with a successful MDA reaction (IChitsaz et al 



2012) 



2011 
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We made an error profile that matches the error statistics of Illumina HiSeq 2000 for ART. Using 
our profile, ART injects on average 1% error into the reads, because of which we need to eliminate 
erroneous contigs in the assembler. HyDA 1 . 0, and also its predecessor Velvet-SC, have an iterative 
algorithm to remove low coverage contigs, which is explained in Section 13.1.11 In each iteration, 
Squeezambler 1 . provides HyDA 1 . with a coverage cutoff as a percentage of the mean coverage 
of each color. That percentage is constant in all iterations. 

We designed three collections of cells, the statistics of which are summarized in Table [2j In the 
first collection, there are 62 cells with 6 distinct genomes. In this collection, we put 22 different MDA 
results from NC_004663.1 and 22 from FP929051.1 to play the role of highly abundant genomes in a 
sample as well as 1 from NC_016776.1 and 2 from NC_008532.1 to represent low abundance genomes 
in the same sample. In the first collection, the number of distinct genomes is approximately one 
tenth of the number of cells but with a wide range of abundances. The second collection is the 
sparsest collection among the three, where the number of distinct genomes is approximately one 
twentieth of the number of cells. The third collection is the most diverse of the three, where the 
number of distinct genomes is approximately one sixteenth of the number of cells. 



Table 2: Our simulation setups: (i) 62 cells; 6 species, (ii) 97 cells; 5 species, and (iii) 112 cells; 7 
species. 



Abundance (%) 

NCBI ID 62 cells; 6 species 97 cells; 5 species 112 cells; 7 species 



NC.004663.1 


22 


36% 


23 


24% 


23 


21% 


NC.009614.1 


7 


11% 





0% 


7 


6% 


NC.009615.1 


8 


13% 





0% 


8 


7% 


NC.008532.1 


2 


3% 





0% 





0% 


NC.016776.1 


1 


1% 





0% 





0% 


FP929046.1 





0% 


12 


12% 


12 


11% 


FP929051.1 


22 


36% 


35 


36% 


35 


31% 


FP929053.1 





0% 


12 


12% 


12 


11% 


FP929055.1 





0% 


15 


16% 


15 


13% 



4.2 Simulation Results 

We ran Squeezambler 1 . for the three collections described above, the results of which are sum- 
marized in Table El Most of the Squeezambler 1 . parameters were the same for all three collec- 
tions. The assembly inclusion threshold constant per cell was chosen r = 0.2 which means at most 
20% of the assembly can vary among multiple instances of the same genome. Taking into account 
the genomic sequence loss caused by the MDA, sampling of the amplicons, and sequencing errors, 
20% is a reasonable choice. This is not an optimized value and is chosen based on the authors' 
intuition. We chose r conservatively in this work so that distinct genomes are not lost but some 
equivalent genomes are detected as distinct. Finding the optimal value for r needs a thorough 
study which is beyond the scope of this paper. 

The fc-mer size for HyDA 1 . was chose n to be k = 55 which is a common choice for the chosen 



Illumina error profile (|Chitsaz et ali . 120111 ). The coverage cutoff, as a percentage of the coverage 
mean, was chosen to be 100%, and the minimum contig length was 100 bps for HyDA 1.0. The 
coverage mean is estimated based on the assembly size in the first iteration, which is often larger 
than the actual genome size due to a myriad of erroneous low coverage /c-mers. This causes the 
initially estimated coverage mean to be a small fraction of the final coverage mean after error 
removal. 
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Table 3: Squeezambler 1 .0 results for the three setups summarized in Tabled For some methods 
we report the results for two different values of initial sequencing coverage per cell. 



setup 


Method 


Sequencing 


Total 


Max 


JNo. OI 


Iterations 






per Cell in the 


Sequencing* * 


Barcodes 


Predicted 








1 st Iteration 


(bps) 




Distinct 








(.bps J 






Genomes 




oz cens, 


— '■ — i ii ui 

single ceil co-assembly 


OO 1V1 









i 
i 


6 distinct 


divide and conquer 


7 M 


3.0 G 


10 


8 


6 


genomes 














97 cells; 


single cell co-assembly 


63 M 


6.1 G 


97 


5 


1 


5 distinct 


single cell co-assembly 


31 M 


3.0 G 


97 


11 


1 


genomes 


divide and conquer 


1 M 


3.2 G 


17 


5 


7 




divide and conquer 


7 M 


2.9 G 


10 


6 


7 


112 cells; 


single cell co-assembly 


63 M 


7.1 G 


112 


7 


1 


7 distinct 


single cell co-assembly 


31 M 


3.5 G 


112 


14 


1 


genomes 


divide and conquer 


1 M 


7.1 G 


33 


11 


7 



Xi/IJul- 

**b in Q. 
***m in ([3]) 



Squeezambler 1 . has an option to choose the number of initial groups in the first iteration, 
g. If g is chosen to be equal to the number of cells, then Squeezambler 1 . simulates the single 
cell co-assembly of all the cells. If g = 2, then Squeezambler 1 . simulates the divide and conquer 
algorithm described in Section 13.21 Although experimenting with different g values may improve 
the results, we do not have data for it. 

Before any sequencing is done, the algorithm has no idea about the genome sizes, various distinct 
genomes, and abundance of each genome. Therefore, an unbiased algorithm has to sequence from 
each cell exactly the same amount right in the 1 st iteration. That amount in our algorithm, denoted 
by b\jj\l\j\, is an input parameter to Squeezambler 1 .0 which was chosen to be between 1 Mbps 
and 63 Mbps as reported in the third column of Table O In our setup, the size of the 9 distinct 
genomes varies between 1.8 Mbps and 6.3 Mbps; see Table [U Therefore, 1 Mbps sequencing 
provides between 1/6 x and 1/2 x coverage, and 63 Mbps sequencing provides between 10 x and 
35 x coverage. 

The input parameter c, which controls the amount of sequencing in subsequent rounds, was 
chosen to be c = 10, which means 10 x expected coverage from each genome in each collection. We 
observed that in practice 10 x coverage of each dis tinct genome provides sufficie nt information for 
reliable evaluation of This is consistent with the Lander and Waterman . 19881 analysis, in which 
the statistics of gaps and contigs in terms of coverage is characterized. Based on that analysis, 10 x 
coverage yields the entire genome without gap with high probability. 

Our divide and conquer algorithm exhibits significant improvement in maximum barcodes, and 
in most cases the total number of base pairs sequenced, over the single cell co-assembly method. 
For the 97 cells, 5 distinct genomes collection, our divide and conquer algorithm requires only 2.9 
Gbps sequencing and 10 barcodes in comparison to 3.0 Gbps sequencing and 97 barcodes consumed 
by the single cell co-assembly method. Similarly for the 62 cells, 6 distinct genomes collection, our 
divide and conquer algorithm requires only 3.0 Gbps sequencing and 10 barcodes in comparison to 
3.9 Gbps sequencing and 62 barcodes required by the single cell co-assembly method. Even though 
for the 112 cells, 7 distinct genomes collection, our divide and conquer algorithm outperforms single 
cell co-assembly in terms of the number of barcodes, by 33 vs. 112, it requires 7.1 Gbps sequencing 
which is more than that used by single cell co-assembly (3.5 Gbps). 

In all of our experiments, all distinct genomes were correctly detected. Therefore, our results 
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exhibit ultimate sensitivity. However, in some experiments multiple cells with similar genomes 
were identified as distinct, which is not an issue for our problem because based on the results of 
Squeezambler 1 . 0, those cells that are identified as with distinct genomes can be deeply sequenced 
and assembled afterwards. For the 62 cells, 6 distinct genomes collection, the number of detected 
distinct genomes was between 6 and 8. For the 97 cells, 5 distinct genomes collection, that number 
was between 5 and 11, and for the 112 cells, 7 distinct genomes collection, that was between 7 and 
14. This specificity is reported in the sixth column of Table [3j Note that the number of sequencing 
rounds (iterations) for single cell co-assembly is always 1, and for divide and conquer with g = 2, 
it is [log 2 n\ . 

Due to the computational intensity of MDAsim 1.0, HyDA 1.0, and Squeezambler 1.0, we 

report our results for only small examples in order to provide a proof of concept. We also chose 
our parameters conservatively, and without optimization, so that we do not compromise accuracy. 
Moreover, our examples have in the order of 100 cells and 6 distinct genomes, whereas real world 
samples are much sparser as the number of cells may be in the order of billions and the number of 
distinct genomes at most in the order of thousands. Therefore, we expect the reduction in the total 
sequencing and maximum barcodes to be higher for real world applications than what we report 
in this paper. 

5 Conclusion 

We presented an adaptive divide and conquer algorithm for distilled sequencing and de novo as- 
sembly of distinct genomes in a bacterial community, e.g. human gut microbiome. Samples derived 
from such communities are often sparse in the sense that the number of distinct genomes is much 
less than the number of cells. Our algorithm exploits sparsity to decrease the amount of sequencing 
and the number of multiplexing barcodes needed for single cell sequencing and de novo assembly. 

We implemented our algorithm in a tool which we call Squeezambler and performed simulation 
experiments to demonstrate its power. Our results show that: i) the number of required barcodes 
with our divide and conquer algorithm is less than that required by the naive approach, and that 
ii) the amount of sequencing needed remains the same or decreases. Due to the computational 
intensity of the problem, only small examples with low sparsity were studied in this work. Real 
world samples are much sparser (~ 1000 species in ~ 10 14 cells) than the examples here (~ 5 species 
in ~ 100 cells). Also, the parameters used to run our tool were chosen conservatively and without 
optimization. Therefore, we expect the improvement of our algorithm to be higher than what we 
reported in this paper in real world situations. Squeezambler 1 . identifies all distinct genomes 
in the sample which are candidates for different strains/species. Those cells that are identified 
as having distinct genomes need to be subsequently deeply sequenced and assembled in order to 
obtain a more detailed assembly. 
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